Error: Cannot open file for writing:
* 'data/cleaned_combined_data.csv'
Final Report
Introduction
Economic output and a population’s health are two factors that are deeply connected. Gross domestic product (GDP) per capital (expressed in 2021 dollars) reflects a country’s ability to invest in things like education, sanitation, and medical care for its citizens. Life expectancy at birth represents the total impact of the investments made by a country to protect its people and increase their survival rates and lifespan. Evidence from previous studies suggest that wealthier societies live longer Miladinov (2020), and the Preston Curve Preston (1975), a popular graphical representation of a country GDP per capita to average life expectancy, supports this view. However, the how strong and consistent this pattern has held over the last 200 years is still answered.
This report investigates the relationship between GDP and life expectancy for 195 countries using annual data provided by Gapminder. We will create a country-year data set, visualize historical patterns, estimate using a log-linear model, and assess future predictive performance using k-fold cross-validation.
Hypothesis
The research hypothesis is that a higher economic output per person in a country is associated with a longer life expectancy. We predict a positive and moderately strong relationship and a 10x increase in GDP per captial should result in a 10 - 15 year gain in life expectancy, similar to the slope of the Preston Curve.
Data Cleaning
Sources
- Life expectancy at birth - The average number of years a newborn is expected to live, given age‑specific mortality rates (Gapminder 2024b).
- GDP per capita (2021 dollars) - THe inflation adjusted economic output divided by mid‑year population (Gapminder 2024a).
Processing Workflow
- Reshaping data - Converted wide spreadsheets (years in columns) to long format (one row = country‑year)
- Standardizing GDP values - Translated abbreviated entries such as “10k” to numeric 10,000
- Merging fields - Inner join on country and year keeps observations with both variables present
- Filtering missing and implausible data - Dropped rows with blanks and excluded life‑expectancy values <0 or >120 years.
- Filtering historical data - Restricted to years ≤2024 to avoid model‑based future projections.
These processing steps prioritize the quailty of the data over a larger example size. A note is that excluding partially observed countries can lead to biases toward countries with more observations, which is an issue we will address.
Methods
Visualization
- Since GDP is very skewed, we will plot it on a base-10 logarithmic scale. After averaging each country’s values across all years results in Figure 1 (static cross‑section). An animated version (Figure 2) shows yearly trajectories from 1800 to 2024, showing a drift toward higher prosperity and longevity with widening dispersion.
Statistical Model
Let \(\mathrm{LE}_i\) represent average life expectancy and \(\mathrm{GDP}_i\) represent average GDP per capita for country \(i\). From by Figure 1, we fit the ordinary-least-squares model:
\[ \widehat{\mathrm{LE}}_i = \beta_0 + \beta_1 \log_{10}(\mathrm{GDP}_i). \tag{1} \]
The slope \(\beta_1\) represents the expected change in life expectancy from a 10x GDP increase.
Model-fit diagnostics
R² - Variance of the fitted values divided by variance of the observations
k-fold cross-validation - With \(k = 19\) (≈10 countries/fold), refit on \(k - 1\) folds and evaluate R² on the hold out fold, and repeating for every fold.
Results
Estimated relationship
\[ \widehat{\mathrm{LE}} = \mathbf{-5.23} \ \text{years} \ +\ \mathbf{14.20} \log_{10}(\mathrm{GDP}). \]
Slope - (\(\beta_1 = 14.20\)). Each 10x GDP increase raises life expectancy by ≈14 years (\(p < 0.001\))
Intercept - The \(-5.23\) years constant lies outside the observable GDP range and is not meaningful in this context
Goodness-of-fit
| Variance decomposition for model (1). | |
|---|---|
| Component | Value |
| Total variance (A) | 58.55 |
| Explained variance (B) | 36.66 |
| Residual variance (C) | 21.89 |
| R² = B/A | 0.63 |
The model explains 63% of the differences in life expectancy between countries, which is a strong result considering it only uses GDP per capita.
Cross‑validation performance
The mean out‑of‑sample R² is 0.85, which exceeds the in‑sample 0.63. All 19 folds show positive R² values (range ≈0.30–1.05), indicating that it can be generalized and there is no evidence of over‑fitting.
Discussion of results
The results show a clear link between a higher GDP and longer life expectancy over the last 200 years. Even when using the simple model, we can explain 2/3 fo the differences in life expectancy between different countries and make accurate predictions for countries that were not used in training.
Limitations
Based on the data and real world dynamics, some limitations include 1. Patterns for a specific country may not always hold for an individual. Ex: A family is more well off than most others in their country. So they expected lifespan would likely be an outlier. 2. Other important factors in determining lifespan such as general education, quality of health care, and government prioritization of health were not included in the data 3. Major events in history such as the Covid-19 pandemic could have caused a change that shifted this relationship but we have not yet seen its effects
Conclusion
Throughout the last 200 years, people in richer countries have longer life. A 10x increase in GDP is links to an increased life expectancy of 14 years. While the economics of a country is not the only factor that affect life expectancy, it seems like an accurate measure of a country’s overall prosperity and plays an important role in how long their citizens live. Further research into this topic could explore adding more variables, such as how much is spent of education or healthcare, if the relationship between GDP and life expectancy differs in specific regions of the world, or if there is a more complex pattern.